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Automatic anomaly detection in high energy collider data 
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We address the problem of automatic anomaly detection in high energy collider data. Our ap- 
proach is based on the random generation of analytic expressions for kinematical variables, which 
can then be evolved following a genetic programming procedure to enhance their discriminating 
power. We apply this approach to three concrete scenarios to demonstrate its possible usefulness, 
both as a detailed check of reference Monte-Carlo simulations and as a model independent tool for 
the detection of New Physics signatures. 



I. INTRODUCTION 

The analysis of data produced by existing and forth- 
coming high energy collider experiments is notoriously 
challenging, both from theoretical and experimental per- 
spectives. The situation worsens with hadron collisions 
at high luminosity due their intrinsic complexity and the 
huge amount of events to be considered. Existing ef- 
forts to face those challenges and isolate signal signa- 
tures above predominant backgrounds generally belong 
to either of two categories. On the one hand, top-down 
approaches take advantage of the prior knowledge of spe- 
cific theoretical framework (e.g., the Standard Model, 
or a particular supersymmetric scenario) to predict and 
simulate interesting signatures in order define advanced 
and efficient isolation strategies. Those strategies can 
then be applied to real data and the results readily in- 
terpreted in terms of (model dependent) constraints on 
a given parameter space. On the other hand, bottom- 
up approaches first focus on data through simple physi- 
cal observables (e.g., invariant masses, sum of transverse 
momenta) in order to identify possible deviations from 
background expectations. 

Two important remarks can be made at this stage. 
First in both cases our ability to precisely simulate back- 
grounds and thus, indirectly, to carefully validate Monte- 
Carlo predictions from various sources, is a crucial ingre- 
dient for discovery. Second, the nature of some of the 
most relevant signals could very well be such that both 
types of approach initially fail to identify them. This 
is particularly true if relevant signatures do not present 
enough similarities with those studied in the context of 
existing top-down analysis, and, at the same time, are 
sufficiently complex not to lead to statistically signifi- 
cant variations in the limited set of physical observables 
considered in bottom-up approaches. 

The present work aims at addressing those questions 
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by introducing a new method to automatically detect 
and characterize anomalies (i.e., statistically significant 
deviations from a given reference) both in Monte-Carlo 
samples and real data. This method makes use of the 
genetic programming technique introduced by Koza [l[ 
(see [2] for a global overview and B for examples of ap- 
plications in High Energy Physics \W\), and used recently 
to automatically "derive" fundamental conservation laws 
from chaotic pendulum data 4]. Contrary to top-down 
approaches, no prior assumption is made on the nature of 
the anomalies to be isolated and, contrary to bottom-up 
approaches, the technique is not limited to a small sub- 
set of physical quantities but can take advantage of the 
discriminating property of more complex observables. 

This letter is organized into three parts. First, we de- 
fine the anomaly detection method based on the genetic 
programming technique. Second, we apply this method 
to three simple yet relevant analysis scenarios at the LHC 
to demonstrate its usefulness both as a test of Monte- 
Carlo simulations and as a model independent tool for 
early stage New Physics discovery. Finally, we comment 
on the obtained results and discuss prospects to general- 
ize our approach to a wider class of final state configura- 
tions. 



II. THE METHOD 

We start by defining a set (also called population) 
of formulas (ft randomly built on a set of terminals 
(typically, variables such as the four-momenta of recon- 
structed objects in the detectors) and a set of functions 
(e.g., addition, invariant mass). Each of those formula is 
represented by a syntax tree, as shown in Figure [TJ The 
initial method for the random generation of the formula 
set is arbitrary, but a popular choice is a mixture of the 
so-called "full" and "grow" algorithms leading to trees 
with branches of constant or varying lengths. Each func- 
tion and terminal is strongly typed such that the resulting 
formulas are built by combining quantities with physical 
units in a meaningful way. 

After the initialization procedure, the fitness of the 
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FIG. 1: Syntax tree for the example formula ip = m(pi + 
P2)A(^(pi,P3), where pi are ordered particle 4-momenta and 
m and A<^ represent respectively the invariant mass and the 
difference in azimuthal angles functions. The function and 
terminal sets are in this case {+, x,m, A^} and {pi,p2,P3}. 
The depth, i.e., the length of the longest branch, is 4. 



generated formulas, i.e., their ability to highlight dis- 
crepancies between a considered set of data and a given 
sample of events considered as a reference, is evaluated. 
Assuming that both the data and the reference sample 
are normalized to a certain number of events and, for a 
given formula are distributed over a certain range in 
n bins of equal lengths denoted d\l\ and r[Z], we propose 
to use 




(1) 



as a naive estimator of the formulas ip fitness when the 
statistical uncertainty over the reference r[Z] is negligi- 
ble. If the formulas pi are random and independent, and 
the number of considered events sufficiently large, this 
estimator initially follows, by definition, a distribu- 
tion with n degrees of freedom. In practice, in order to 
deal with bins in which the observed number of signal 
events is very low (i.e., below a certain threshold cr^j„), 
the denominator of expression ([Ij can be replaced by 
max{yj d[V\, amm) to avoid artificially large contributions. 

The next step is to make the initial population evolve. 
A new population of identical size is generated by creat- 
ing new formulas according to three distinct reproduction 
mechanisms. First, given two parent formula, the subtree 
crossover randomly and independently selects a crossover 
node in each parent tree and creates an offspring by re- 
placing the subtree rooted at the crossover point in the 
first parent with the subtree rooted at the crossover point 
in the second parent. In order to avoid the creation of 
offsprings with arbitrarily long depths, the initial node 
selection can be adjusted such that the offspring depth 
always remains smaller than a fixed maximal value. Sec- 
ond, the mutation which corresponds to a crossover be- 
tween a parent formula and another formula, external to 
the initial population and generated randomly. Finally, 
the copy mechanism exactly reproduces a parent formula. 

Each mechanism has a certain probability to happen 

Pcross, Pmut and Pcopy, with, typically, Pcross » Pmut > 



Pcopy To ensure a constant selection pressure, parents 
with an higher fitness are selected more frequently to 
take part in the reproduction. The presence of three 
distinct reproduction mechanisms satisfies three crucial 
requirements. Formulas with the best discrepancy power 
should share their "genetic inheritance" more often, they 
should have the opportunity to exactly duplicate (to 
avoid the disappearance of the best candidates) and a 
certain amount of randomness should be present to de- 
crease the sensitivity on initial conditions (i.e., an initial 
population with a limited size). 

The fitness evaluation and evolution steps are repeated 
alternatively during a given fixed number of iterations. 
The best formula is identified for each iteration, and the 
best candidate for all iterations is reported. Alterna- 
tively, the whole last population, or a subset of it, can 
also be reported as worth considering alternatives. The 
fitness of the best individual(s) allows us to es- 

timate of the actual significance of the observed discrep- 

ancy(ies). First, the p-value Pbest associated to Xv'"^'^*'' 
is calculated assuming a normal distribution with n 
degrees of freedom. This value cannot be directly inter- 
preted as a statistical deviation since the total popula- 
tion, i.e., all the formulas generated during Nuer iter- 
ations, is not randomly distributed and biased by con- 
struction towards high x^ values. To circumvent this 
difficulty, we calculate the p-value Pref associated with 

the value x'v ^'^'^'^ obtained by applying multiple times the 
same algorithm on a signal- free reference sample contain- 
ing the same number of events as the data, and by av- 
eraging the results. Finally, we propose to use the ratio 
Pbest / Pref 1 which correspouds to the probability of ob- 
taining the value x%i ^ Xv a subsample of formu- 
las with x%i ^ xip^"^^^ as an estimate of the discrepancy 
significance. 



III. APPLICATIONS 

To illustrate the proposed technique, we apply it to 
three different simple analysis scenarios. Applicability 
of the method to a wider class of objects and to more 
realistic data are discussed in the next section. The re- 
sults have been produced using a prototype genetic pro- 
graming library [Sj running on a single desktop machine. 
The algorithm parameters are kept constant to empha- 
size process independence and allow or an easy compar- 
ison of the results. The population size is 50. Formulae 
have an initial depth of 4 which can increase up to 5. The 
number of iteration is fixed to 10 and the reproduction 
mechanism probabilities defined in the previous section 
are set to Peross = 0.75, Pmut = 0.20 and Pcopy = 0.05. 
The terminals are the considered particle four-momenta 
and and the function set is 



/ = {+, X , TO, Ar/, A0, Ai?, cos 0} 



(2) 
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where m is the invariant mass, Arj and A(j) the dif- 
ferences in pseudo-rapidity and azimutal angle, AR = 
^jArf- + and cos 9 is the cosine of the angle between 
the particle direction and the beam axis. The fitness es- 
timator ^ is computed for 10 bin distributions. To es- 
timate the statistical error for the output, it has been 
applied 100 times on different samples and the (gaussian 
distributed) results have been summarized in terms of 
mean and standard deviation. 



A. Spin correlation in top quark decay 

Our first example aims at demonstrating how the 
method can be used to quantitatively compare two 
Monte-Carlo simulations supposedly describing the same 
process (es). We confront two samples containing top 
quark pairs decaying into a full Icptonic final state, 
namely pp ^ it /i+fp6/x^I7^6. The parton-level of 
the first saniple is simulated with MadGrapli/MadEvent 
(MG/ME) Q using the exact 2 — >■ 6 matrix element. 
The second sample is obtained using Pythia only, with 
the exact 2 — > 2 matrix element for the top quark pair 
production, but generating the six particle final state 
through a flat phase space decay. Obviously, the two 
samples would gave similar results for simple quantities 
such as transverse momenta or invariant masses. How- 
ever, they would differ significantly when considering spin 
dependent observables such as particular angular distri- 
butions. 

Applying the method described above, we obtain a fit- 
ness estimator of 47.7 ± 9.8 when comparing the exact 
2 — > 6 matrix element sample with a fiat phase space de- 
cay sample, to be compared to a value of 12.3 ± 2.2 when 
comparing two different samples implementing the same 
flat phase space decay. This translates into a probability 
larger than 99% that the two samples are based on physi- 
cally incompatible descriptions of the same process. Note 
that the most discriminating individuals all contain ex- 
plicit references to angular variable distributions such as 
A<j), pointing out to the exact nature of the discrepancy. 



B. W' boson and SUSY decay chain 

In the following examples, we consider two cases. The 
flrst is the s-channel production of a heavy (1 TeV) W 
gauge boson, decaying as W WZ — J> to 
give a particularly clean three muons final state. The sec- 
ond process considered yields an identical final state, i.e., 
three leptons, but it originates from a longer, supersym- 
metric decay chain -J> {xt v^jil v^^i+xDixl 
IJ.~fi^ H~H'^Xi), where we assumed = m^o — 

500 GeV, m~+ = 202 GeV and m-^- = 144 GeV. 

The method is applied to the comparison of two sam- 
ples, each containing a mixture of one of these BSM sig- 
nals with the associated irreducible SM background, with 




FIG. 2: Fitness as a function of S/{S + B) for the W 
WZ — > 3/i (red) and for the pp (xl '^mMl 
i'nM'''Xi)(X2 -5> M"^^Xi) (blue) signals. The red 

(blue) line and the associated band indicate the mean and 
95% deviation for the fitness of the best function respectively. 
The dotted line indicates the 95% p- value threshold computed 
using a signal- free sample, as described in the text. 

a variable S/{S + B) ratio, and a sample containing only 
the SM background pp — > WZ ^^^^ pL^v. The 

event production consists of to the parton-level output 
of MG/ME. The relation between the average fitness of 
the best individuals after a fixed number of iteration and 
the S/{S + B) ratio is shown in Figured For the W 
case the signal remains unvisible below an Sj {S + B) 
ratio of approximatively 7%. The slightly higher fitness 
values observed in the 4-6% region are not large enough 
to be statistically significant, considering the large vari- 
ation of the results in the signal free region. Above 7%, 
the average fitness is large enough to provide a statis- 
tically significant discrimination of the signal over the 
background. As expected, all the most discriminant dis- 
tributions contain combinations proportional to the total 
invariant mass of the three final state leptons. 

In the SUSY case, the presence of a signal can be de- 
tected at a 95% C.L. for S/{S + B) > 10%, which is 
not very different from the result one would expect from 
classical cut based studies using similar data. It turns 
out that the most discriminant distributions are in fact 
mainly based on the total invariant mass of the three final 
state leptons, but a large fraction of them also contain 
information related to their angular distribution. 



IV. DISCUSSION 

We discuss the merits of our proposal in the context of 
three questions: 

How does it compare with approaches using a fixed set 
of distributions ? As perhaps expected by construction, 
our approach can help isolating discrepancies with larger 
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statistical significance than strategies based on a limited 
fixed set of discriminating distributions Q. However, 
the two types of approach remain complementary with, 
on one hand, faster calculations and an easier control of 
systematic uncertainties in the context of real data analy- 
sis and, on the other hand, better prospects for detection 
of anomalies with a complex structure (e.g., involving 
kinematic information of more than two objects). 

How does it compare with random searches ? A com- 
mon criticism leveled at genetic programing algorithms 
is their possibly low efficiency when compared to a sim- 
ple random search over the space of tree formula. Our 
tests have confirmed that similar results can be achieved 
with random searches for simple cases. However, for 
more complex cases (such as our SUSY example) ran- 
dom searches took significantly longer (5-10 times) CPU 
time to isolate the best discriminating formula, which of- 
ten appear longer and unnecessarily more complex. This 
lets us postulate that particularly complex cases, such 
as many particle final states, could only be handled by 
techniques implementing an optimization procedure. Fi- 
nally, the proposed genetic programming solution allows 
to optimize a complete set of formula among which differ- 
ent variations can be finally preferred based on simplicity 
over the most discriminating one, where a random search 
would provide only a few relevant solutions. 

Can it be generalization to more complex objects with 
systematic uncertainty propagation ? The examples we 
presented only use muon momentum information such 
that, in first approximation, systematic reconstruction 
uncertainties can be safely neglected compare to statisti- 
cal fluctuations. However, when considering more com- 
plex objects (jets, missing Et), effects like finite energy 
and spatial resolutions, reconstruction algorithm limita- 
tions, and various other detector effects cannot be over- 
looked. We would suggest to use transfer functions (such 
as in Matrix Element based techniques to propagate 
errors associated, for example, with jet property mea- 
surements. The resulting additional error would then be 
naturally convoluted with the intrinsic statistical error 
associated with the method. This approach would nat- 
urally lead to a serious increase in the calculation com- 



plexity, which should however be well-manageable using 
a dedicated library written using a fast, compiled lan- 
guage. 



V. CONCLUSION 

We have proposed a new approach to automatize 
anomaly detection in high energy collider data. Our 
method is based on the random generation of analytic 
expressions for complex kinematical variables, which are 
then evolved using a genetic programming procedure to 
enhance their discriminating power. We successfully ap- 
plied this method in three different test cases and demon- 
strated its potential usefulness for both Monte Carlo sam- 
ple validation and BSM anomaly detection. 

Besides those already mentioned, we believe there are 
several prospects for improvement and generalization of 
our approach. First, though it is a model independent 
anomaly detection technique, it might well be helpful at 
a later stage of a classical "top down" analysis to build 
(instead of guessing) the optimal, most discriminating 
distribution variable after several acceptance and analy- 
sis cuts have been applied. Second, a similar technique 
could be used to improve the "blind" generation of BSM 
signatures in the context of effective scenarios (e.g., see 
|10|). Finally, we hope genetic algorithms would become 
part of pragmatic solutions to determine the best BSM 
model fitting a restricted set of data in the context of the 
so-called "LHC inverse problem" . Work along those lines 
is already in progress. 
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